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ABSTRACT 

We describe Space Warps, a novel gravitational lens discovery service that yields 
samples of high purity and completeness through crowd-sourced visual inspection. 
Carefully produced colour composite images are displayed to volunteers via a web- 
based classification interface, which records their estimates of the positions of can¬ 
didate lensed features. Images of simulated lenses, as well as real images which lack 
lenses, are inserted into the image stream at random intervals; this training set is used 
to give the volunteers instantaneous feedback on their performance, as well as to cali¬ 
brate a model of the system that provides dynamical updates to the probability that a 
classified image contains a lens. Low probability systems are retired from the site peri¬ 
odically, concentrating the sample towards a set of lens candidates. Having divided 160 
square degrees of Canada-France-Hawaii Telescope Legacy Survey (CFHTLS) imaging 
into some 430,000 overlapping 82 by 82 arcsecond tiles and displaying them on the 
site, we were joined by around 37,000 volunteers who contributed 11 million image 
classifications over the course of 8 months. This Stage 1 search reduced the sample 
to 3381 images containing candidates; these were then refined in Stage 2 to yield a 
sample that we expect to be over 90% complete and 30% pure, based on our analysis 
of the volunteers performance on training images. We comment on the scalability of 
the Space Warps system to the wide field survey era, based on our projection that 
searches of 10® images could be performed by a crowd of 10® volunteers in 6 days. 

Key words: gravitational lensing: strong - methods: statistical - methods: citizen 
science 
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1 INTRODUCTION 

Strong gravitational lensing - the formation of multiple, 
magnified images of background objects dne to the deflection 
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of light by massive foreground objects - is a very powerful 
astrophysical tool, enabling a wide range of science projects. 
The image separations and distortions provide information 


about the mass distribution in the lens (e.g. Auger et al. 

2010b 

ISonnenfeld et al.||20121 |More et al.||20121 |Sonnenfeld 

et al. 

20151, including on sub- 

galactic scales (e.g. Dalai & 

Kochanek 2002 Vegetti et al. 

2010 Hezaveh et al. 20131. 


Any strong lens can provide magnihcation of a factor of 
10 or more, providing a deeper, higher resolution view of 
the distant universe through these “cosmic telescopes” (e.g. 


Stark et al. 2008 Newton et al. 20111. Tensed quasars enable 


cosmography via the time delays between the lightcurves of 


multiple images (e.g. Tewes et al. 2013 Suyu et al. 20131, 


and study of the accretion disk itself through the microlens- 
ing effect (e.g. Poindexter et al(]|2008|. All of these investi¬ 


gations would benefit from being able to draw from a larger 
and/or more diverse sample of lenses. 

In the last decade the number of these rare cosmic align¬ 
ments known has increased by an order of magnitude, thanks 
to searches carried out in wide field surveys, such as CLASS 


(Browne et al. 2003| e.g.), SDSS (e.g. Bolton et al. 

2006 

Auger et al.|2010a 

|Treu et al.|20111 Inada et al.|2012 

|Hen- 

nawi et al. |20081 

Belokurov et al.||2009 

Diehl et al. 

2009 

Furlanetto et al. 

20131, CFHTLS (e.g. 

More et al. 

2012 


SPT (e.g. Vieira et al.|2013 l, among others. As the number 


of known lenses has increased, new types have been dis¬ 
covered, leading to entirely new investigations. Compound 
lenses (Gavazzi et al. 2008 Collett et al.||2012 1 and lensed 


supernovae (Quimby et al.|2014 Kelly et al.|2015 l are good 


examples of this. 

Strong lenses are expensive to find, because they are 
rare. The highest purity searches to date have made use of 
relatively clean signals such as the presence of emission or 
absorption features at two distinct redshifts in the same op¬ 
tical spectrum (e.g. Bolton et al.|200'4|, or the strong “mag¬ 


nihcation bias” towards detecting strongly-lensed sources in 
the sub-mm/mm waveband (e.g. |Negrello et al.|2010[ |. Such 
searches have to yield pure samples, because they require 
expensive high resolution imaging follow-up; consequently 
they have so far produced yields of only tens to hundreds of 
lenses. An alternative approach is to search images already 
of sufficiently high resolution and colour contrast, and con- 
hrm the systems as gravitational lenses by modeling the sur¬ 


vey data themselves (Marshall et al. 20091. Several square 


degrees of HST images have been searched, yielding sev¬ 
eral tens of galaxy-scale lenses (e.g. Moustakas et al.]|2007 


Jackson 2008| More et al. 2011 Pawase et al. 20141. Sim¬ 
ilarly, searches of over a hundred square degrees of CFHT 
Legacy Survey (CFHTLS) ground-based imaging, also with 
sub-arcsecond image quality, have revealed a smaller number 


of wider image separation group-scale systems (e.g. Cabanac 
et al.|[2007 More et al.||2012 K Detecting galaxy-scale lenses 
from the ground is difficult, but feasible albeit with lower 
efficiency and requiring HST or spectroscopic follow-up to 
confirm the candidates as lenses (e.g. Gavazzi et al. [2014 1. 

How can we scale these lens searches up to imaging 
surveys that cover a hundred times the sky area, such as 
the almost-all sky surveys planned with the Large Synoptic 
Survey Telescope (LSST) and Euclid? There are two ap¬ 
proaches to detecting lenses in imaging surveys. The hrst 
one is robotic: automated analysis of the object catalogs 


and the survey images. The candidate samples produced 
by these methods have, to date, only reached purities of 
1-10%, with visual inspection by teams of humans still re¬ 
quired to reduce the robotically-generated samples by fac¬ 
tors of 10-100 (see e.g. Marshall et al. 2009 More et al.|2012 


Gavazzi et al.|[^14|. In this approach, the image data may 


or may not be explicitly modelled by the robots as if it con¬ 
tained a gravitational lens, but the visual inspection can be 
thought of as a “mental modeling” step. An inspector who 
classifies an object as a lens candidate does so because the 
features in the image that they see can be explained by a 
model, contained in their brain, of what gravitational lenses 
do. The second approach simply cuts out the robot middle- 


man: 

Moustakas et al. (20071; 

Fame et al. (20081; Jackson 

(2008 

1 and Pawase et al. (2014 

1 all performed successful vi- 


sual searches for lenses in HST imaging. 

Until this problem is solved by machine learning tools, 
at present visual image inspection seems unavoidable at 
some level when searching for gravitational lenses. The tech¬ 
nique has some drawbacks, however. The first is that humans 
make mistakes. A solution to this is for the inspectors to op¬ 
erate in teams, providing multiple classihcations of the same 
images in order to catch errors and correct them. Second, 
and relatedly, is that humans get tired. With a well-designed 
classification interface, a human might be able to inspect 
images at a rate of one astronomical object per second (pro¬ 
vided the majority are indeed uninteresting). At 10^ mas¬ 
sive galaxies, and 10 lenses, per square degree, visual lens 
searches in good quality imaging data are limited to a few 
square degrees per inspector per day (and less, if more time 
is spent assessing difficult systems). Scaling to thousands 
of square degrees therefore means either robotically reduc¬ 
ing the number of targets for inspection, or increasing the 
number of inspectors, or both. 

For example, a 10"^ square degree survey containing 10* 
photometrically-selected massive galaxies (and 10® lenses) 
could only be searched by 10 inspectors (at a mean rate of 
1 galaxy per second and 3 inspections per galaxy) in about 
5 years. Alternatively, an automated system could be asked 
to produce a much purer sample: if it was able to reach 
a purity of 10%, this would leave 10® lens candidates (100 
targets per square degree) to be visually inspected. At this 
point the average visual classification time per object could 
well be more like 10 seconds per object, requiring the same 
team of 10 inspectors to work full time for 20 weeks (to 
provide 3 classifications per lens between them). Neither of 
these may be the most cost-effective or reliable strategy. 
Alternatively, a team of 10® inspectors could, in principle, 
make the required 10® image classifications, 10® each, in a 
few weeks; robotically reducing the target list would lead to 
a proportional decrease in the required team size. 

Systematic detection of rare astronomical objects 
by such “crowd-sourced” visual inspection has recently 
been demonstrated by the online citizen science project 
Planet Hunters | |Schwamb et al.|20T^ . Planet Hunters 
was designed to enable the discovery of transiting exoplanets 
in data taken by the Kepler satellite. A community of over 
200,000 inspectors from the general public found, after each 
undergoing a small amount of training, over 40 new exo¬ 
planet candidates. They achieved this by visually inspecting 
the Kepler lightcurves that were presented in a custom web- 
based classihcation interface (Wang et al.||2013L The older 
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Galaxy Zoo morphological classification project (Lintott 


et al. 20081 has also enabled the discovery of rare objects, via 


its flexible inspection interface and discussion forum (Lintott 
et al |2009| ). Indeed, several of us (AV,CC,CM,EB,PM,LW) 


were active in an informal Galaxy Zoo gravitational lens 
search (Verma et al, in preparation), an experience which 
led to the present hypothesis that a systematic online visual 
lens search could be successful. 

In this paper, we describe the Space Warps project, 
a web-based system conceived to address the visual inspec¬ 
tion problem in gravitational lens detection for future large 
surveys by “crowd-sourcing” it to a community of citizen sci¬ 
entists. Implemented as a Zooniverse ( Simpson et al.|[2014 l 
project, it is designed to provide a gravitational lens dis¬ 
covery service to survey teams looking for lenses in wide 
field imaging data. In a companion paper | More et al.|2015 


hereafter Paper II) we will present the new gravitational 
lenses discovered in our first lens search, and begin to in¬ 
vestigate the differences between lens detections made in 
Space Warps and those made with automated techniques. 
Here though, we simply try to answer the following ques¬ 
tions: 


• How reliably can we find gravitational lenses using the 
Space Warps system? What is the completeness of the sam¬ 
ple produced likely to be? 

• How noisy is the system? What is the purity of the 
sample produced? 

• How quickly can lenses be detected, and non-lenses be 
rejected? How many classifications, and so how many vol¬ 
unteers, are needed per target? 

• What can we learn about the scalability of the crowd¬ 
sourcing approach? 

Our basic method in this paper is to analyze the per¬ 
formance of the Space Warps system on the “training set” 
of simulated lenses and known non-lenses. This allows us to 
estimate completeness and purity with respect to gravita¬ 
tional lenses that have the same properties of the training 
set. In Paper H, we carry out a complementary analysis us¬ 
ing a sample of “known” (reported in the literature) lenses. 

This paper is organised as follows. In Section we in¬ 
troduce the Space Warps classification interface and the 
volunteers who make up the Space Warps collaboration, 
explain how we use the training images, and describe our 
two-stage candidate selection strategy. We then briefly in¬ 
troduce, in Section the particular dataset used in our 
first test of the Space Warps system, and how we prepared 
the images prior to displaying them in the web interface. In 
Section]^ we describe our methodology for interpreting the 
classifications made by the volunteers, and then present the 
results of system performance tests made on the training im¬ 
ages in Section We discuss the implications of our results 
for future lens searches in Section]^ and draw conclusions in 
Section [T] 


example, containing a lens candidate or not) and 4) ana¬ 
lyze that classification along with all others to produce a 
final candidate list. We describe step 1 in Section and 
step 4 in Section]^ In this section we take a volunteer’s eye 
view, and begin by describing the Space Warps classifica¬ 
tion interface, the crowd of volunteers, and the interactions 
between the two. 


2.1 The Classification Interface 

A screenshot of the Space Warps classification interface 
(GI) is shown in Figure The GI is the centrepiece of the 
Space Warps website, http://spacewarps.org; the web 
application is written in coffeescript, css and html and fol¬ 
lows the general design of others written by the Zooni¬ 
verse teamQ The focus of the GI is a large display of a 
pre-prepared PNG image of the “subject” being inspected. 
When the image is clicked on by the volunteer, a marker 
symbol appears at that position. Multiple markers can be 
placed, and they can be moved or removed by the classifier 
if they change their mind. 

The next image moves rapidly in from a queue formed 
at the right hand side of the screen when the “Next” or 
“Finished Marking” button is pressed (If a marker is placed 
somewhere in the image, the Next button changes to a “Fin¬ 
ished marking” button). Gravitational lenses are rare: typ¬ 
ically, most of the images will not contain a lens candi¬ 
date, and these need to be quickly rejected by the inspector. 
The queue allows several images to be pre-loaded while the 
volunteer is classifying the current subject, and the rapid 
movement is deliberately designed to encourage volunteers 
to classify quickly. There is no “back” button in the GI: each 
volunteer may only classify a given subject once and subjects 
cannot be returned to after the ‘Next or Finished marking’ 
button has been pressed (this can be a source of missed lens 
candidates when classifying at speed). Note that by ‘clas¬ 
sify’ we mean that we interpret no markers being placed 
as a rejection, and placing at least one marker to mean a 
possible lensing event has been identified. After each classi¬ 
fication, the positions of any markers are written out to the 
classification database, in an entry that also stores meta¬ 
data including the ID of the subject, the username of the 
volunteer (or IP address if thy are not logged in), and a 
timestamp. 

For the more interesting subjects, the GI offers two fea¬ 
tures that enable their further investigation. The first is the 
“Quick Dashboard” (QD), a more advanced image viewer. 
This allows the viewer to compare three different contrast 
and colour balance settings, to help bring out subtle fea¬ 
tures, and to pan and zoom in on interesting regions of the 
image to assess small features. Markers can be placed in the 
Quick Dashboard just the same as in the main GI image 
viewer. The second is a link to that subject’s page in the 
project discussion forum (known as “TALr0’). Here, volun¬ 
teers can discuss the features they have seen either before 


2 EXPERIMENT DESIGN 

The basic steps of a visual search for gravitational lenses 
are: 1) prepare the images, 2) display them to an inspector, 
3) record the inspector’s classification of each image (as, for 


^ The Space Warps web application code is open source and can 
be accessed from https://github.com/zooniv6rse/Lens-Zoo. 
The project was renamed during development to avoid the ques¬ 
tion of who “Len” is. 

^ http://talk.spacewarps.org 
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Figure 1. Screenshot of the Space Warps classification interface at http://spacewarps.org. 


they submit their classification, or after, if they “favourite” 
the subject. The option of reading other opinions on any 
given image or lens candidate before submitting a classi¬ 
fication means that the classifications may not be strictly 
independent; however, the advantage of this system is that 
volunteers can learn from others what constitutes a good 
lens candidate (we are not able to track who visited talk 
before pressing “Finished marking”). This is not, however, 
the primary educational resource; we describe the explicit 
training that we provide for the volunteers in the next sec¬ 
tion. The FITS images of the subject can be further explored 
with Zooniverse “Dashboard Tools,” which include a more 
powerful image viewer that enables dynamic variation of the 
colour balance and contrast (stretch), and for any given view 
to be shared via unique URL back to Talk. The FITS image 
viewing in both the Quick Dashboard and the Dashboard 


Tools are enabled by the javascript library fitsjs (Kapadia 


2015). 


2.2 Training 

Gravitational lenses are typically unfamiliar objects to the 
general public. New volunteers need to learn what lenses 
look like as quickly as possible, so that they can contribute 
informative classifications. They also need to learn what 
lenses do not look like, in order to reduce the false positive 
detection rate. There are three primary mechanisms in the 
Space Warps system for teaching the volunteers what to look 
for. These are, in the order in which they are encountered, 
an inline tutorial, instant feedback on “training images” in¬ 
serted into the stream of images presented to them, and 
a “Spotter’s Guide.” As well as this, we provide “About”, 
“FAQ” and “Science” pages explaining the physics of gravi¬ 
tational lensing. While we expect that the insight from this 
static material should help volunteers make sense of the fea¬ 


tures in the images, we focus on the more dynamic, activity- 
based training early, when engaging new volunteers to par¬ 
ticipate takes priority. 


2.2.1 Inline Tutorial 

New volunteers are welcomed to the site with a very short 
tutorial, in which the task is introduced, a typical image con¬ 
taining a simulated lens is displayed, and the marking pro¬ 
cedure walked through, using pop-up message boxes. Sub¬ 
sequent images gradually introduce the more advanced fea¬ 
tures of the classification interface (the QD and Talk but¬ 
tons), also using pop-up messages. The tutorial was pur¬ 
posely kept as short as possible so as to provide the minimal 
barrier to entry. 


2.2.2 Training Subjects and Instant Feedback 

The second image viewed after the initial tutorial image is al¬ 
ready a survey image, in order to get the volunteers engaged 
in the real task as quickly as possible. However, training then 
continues, beyond the first image tutorial, through “training 
subjects” inserted randomly into the stream. These train¬ 
ing subjects are either simulated lenses (known as “sims”), 
or survey images that were expert-classified and found not 
to contain any lens candidates (these images are known as 
“duds”). The tutorial explains that the volunteers will be 
shown such training images. They are also informed that 
they will receive instant feedback about their performance 
after classifying (blindly) any of these training subjects. In¬ 
deed, after a volunteer finishes marking a training subject, a 
pop-up message is generated, containing either congratula¬ 
tions for a successful classification (for example, “Well done! 
You spotted a simulated lens,” as in Figure or feedback 
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for an unsuccessful one (for example, “There is no gravi¬ 
tational lens in this field!”). A successful classification of a 
simulated lens requires a marker to be placed fairly precisely 
on the lensed features, so as to avoid misinforming the classi¬ 
fier. The pixels in the lensed features are flagged via a mask 
contained in the PNG image transparency “alpha” channel. 
(The mask is not visible in the image.) 

Space Warps can be viewed as a supervised learning sys¬ 
tem, because we include a training set of images in amongst 
the survey “test” images. Consequently, we should expect 
to find lens candidates that look like the sims that we put 
in the training set, and so the design of this training set 
is quite important. The training images must look realis¬ 
tic, in two ways. First, apart from the simulated lenses they 
contain, they must look like all the other survey images - 
otherwise they could be identified as training images with¬ 
out the volunteer learning anything about what lenses look 
like. For this reason, we make the sims by adding in simu¬ 
lated lensed features to real images drawn from the survey, 
and we make the duds by simply inspecting and choosing 
from a small sample of randomly-selected survey images. It 
also means that each dataset presented for inspection in the 
Space Warps interface needs to come with its own training 
set. 


Second, the lensed features themselves must be realistic; 
if they did not resemble real lenses, we could not expect to 
find real lenses in the test images. The details of how we 
generated the sims for the CFHTLS project are given in 
Paper II, where we also carry out performance tests relative 
to real lens candidates in the survey fields. Here, we merely 
note the following aspects to place this paper’s results in 
context. 

We began by selecting massive galaxies and groups in 
the CFHTLS catalogs, by colour, brightness and photomet¬ 
ric redshift. We then assigned plausible mass distributions 
to these “potential lenses:” a singular isothermal ellipsoid 
model for each massive galaxy, and an additional NFW pro¬ 
file dark matter halo for groups and clusters. The mass dis¬ 
tributions were centred, elongated and aligned with the mea¬ 
sured optical properties of the galaxies present. We then 
drew a source from a plausible background population, of 
either galaxies (for the group or galaxy lenses) or quasars 
(for the galaxy-scale lenses). The source redshift, luminos¬ 
ity and size distributions were all chosen to match those 
observed, while the source abundance was artificially in¬ 
creased such that each potential lens had a high probability 
of having a source occur at a position that will lead to it 
being highly magnified. Source galaxies were given plausible 
surface brightness profiles and ellipticities. We then com¬ 
puted the predicted multiple images of the source using the 
GRAVLENS ray-tracing code ( Keeton et al. [2000 1. 

At this point, we applied several cuts to the simulated 
lens pool, in order to reject sims that were likely to be dif¬ 
ficult to spot. While this ensured that the sims had high 
educational value, it means that we should not necessarily 
expect to be able to detect real lens candidates that would 
not pass these cuts. However, human classifiers do possess 
the key advantage of being able to imagine lens systems be¬ 
yond what they were shown during their training. Indeed, 
any real lens candidates we find with properties outside the 
ranges of the training set could be of particular interest, if 
the training set is indeed completely representative of lenses 


that we already know about. Geach et al. (20151, for exam¬ 
ple, report the discovery of a red lensed arc by volunteers 
who had only been trained primarily on blue/green exam¬ 
ples. We investigate this aspect of the Space Warps ap¬ 
proach further in Paper H. In the present project, the most 
important selection cuts we made were to only keep sims 
with a) second brightest image having f < 23 and b) total 
combined image magnitude fainter than 19-20 (Paper H, 
Table 1). The combination of these two reductions resulted 
in a sample of visible lensed image configurations. 

Volunteers were initially shown training images at a 
mean frequency of two in five. Subjects were drawn at ran¬ 
dom from a pool consisting of (at first) 20% sims, 20% duds, 
and 60% test images, and such that no volunteer ever saw a 
given image more than once. As the number of classifications 
made by the volunteer increased, the training frequency was 
decreased from 40% to 2/(5 x 2d"t(iVc/20)+i)/2^^ _ 
the second 20 subjects, 20% for the third 20 subjects, and 
the minimum rate of 10% after that. We did not reduce the 
training frequency to zero because we wanted to ensure that 
the inspectors remained alert. 

This training regime meant that in the first 60 images 
viewed, each volunteer was shown (on average) 9 simulated 
gravitational lenses (as well as 9 pre-selected empty “dud” 
fields). This is a much higher rate than the natural one: 
to try and avoid this leading to over-optimism among the 
inspectors (and a resulting high false positive rate), we dis¬ 
played the current “Simulation Frequency” on the classifi¬ 
cation interface (“1 in 5” in Figure and maintained the 
consistent theme in the feedback messages that lenses are 
rare. 

In Figures and we show example training images 
from this first Space Warps project (Section]^ below). 


2.2.3 Spotter’s Guide 

The instant feedback provides real-time educational re¬ 
sponses to the volunteers as they start classifying; as well 
as this dynamic system. Space Warps provides a static ref¬ 
erence work for volunteers to consult when in doubt about 
how to perform the task. This “Spotter’s Guide” is a set of 
webpages showing example lenses, both real and simulated, 
and also some common false positives, drawn from the pool 
of survey images. The false positives were identified during 
the selection of the “dud” training images (previous section). 
For easy reference, the lenses are divided by type (for ex¬ 
ample, “Lensed Galaxies,” “Lensed Quasars” and “Gluster 
Lenses”), as are the false positives (for example, “Rings and 
Spirals,” “Mergers,” “Artifacts” and so on). The example 
images are accompanied by explanatory text. The Spotter’s 
Guide is reached via a button on the left hand side, or the 
hyperlinked thumbnail images of the “Quick Reference” pro¬ 
vided on the right hand side, of the classification interface. 

Most of the text of the Spotter’s Guide focuses on the 
kinds of features that gravitational lenses do or don’t pro¬ 
duce. To complement this, the website “Science” section con¬ 
tains a very brief introduction to how gravitational lensing 
works, which is fleshed out a little on the “FAQ” page. The 
FAQ also contains answers to questions about the interface 
and the task set. 
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2.3 Staged Classification 

We now describe briefly the two-stage strategy that we em¬ 
ployed in this first project: initial classification (involving 
the rejection of very large numbers of non-lenses), and re¬ 
finement (to further narrow down the sample). The web ap¬ 
plication was reconfigured between the two stages, to assist 
in their functioning. 


The frequency with which training images were shown was 
fixed at 1 in 3. Finally, the Spotter’s Guide was upgraded 
to include more examples of various possible false positives, 
divided into sub-classes. We did not retire any subjects dur¬ 
ing Stage 2 classification, instead continuing to accumulate 
classifications through till the end of a fixed 4-week time 
period. 


2.3.1 Stage 1: Initial Classification 


3 DATA 


The goal of the Stage 1 classification was to achieve a high 
rejection rate, while maintaining high completeness. In this 
mode, therefore, the pre-loading of images was used to make 
the sliding in of new subjects happen quickly, to provide a 
sense of urgency: initial classification must be done fairly 
quickly for the search to be completed within a reasonable 
time period. We expect some trade-off between speed and 
accuracy: we return to this topic in the results section below. 
Completion of the search requires subjects to be “retired” 
over time, as a result of their being classified. We did this 
by analyzing the classifications on a daily basis, as described 
in Section]^ below. As subjects were retired, new ones were 
ingested into the web app for classification. This means that 
the discovery of lens candidates in Stage 1 is truly a com¬ 
munity effort: to detect a lens candidate, many non-lenses 
must first be rejected, and several classifications by different 
inspectors are needed in either case. 

The Stage 1 training set was chosen to be quite clear 
cut, in order to err on the side of high completeness. When 
defining the training duds, we discarded anything that could 
be considered a lens candidate (see Section 2.2.31. This 
meant that objects that look similar to lenses, such as galaxy 
mergers, tidal tails and spiral arms, pairs of blue stars and 
so on, were specifically excluded from the training set, and 
therefore we expect some of those types of object to ap¬ 
pear in the Stage 1 candidate list. As described above, the 
training sims for Stage 1 were also selected to be relatively 
straightforward to spot. 


2.3.2 Stage 2: Refinement 

The design of a Stage 1 classification task, and its training 
set, should lead to a sample of lens candidates that has high 
completeness but may have low purity. To refine this sample 
to higher purity, we need to reject more non-lenses, which 
means providing the volunteers with a more realistic and 
challenging training set as they re-classify it. The more de¬ 
manding Stage 2 training set was generated as follows. The 
Stage 2 duds were selected from a small random subset of 
the Stage 1 candidates (i.e., the Stage 2 duds were expert- 
defined Stage 1 false positive detections), while the Stage 2 
sims were chosen to be a subset of the Stage 1 sims, none 
of which were deemed “obvious” by the same expert classi¬ 
fiers. This meant that the Stage 2 sims had fainter and less 
well-separated image features than in the Stage 1 training 
set. Figures and show some example images from the 
resulting Stage 2 training set. 

We also attempted to encourage discernment by chang¬ 
ing the look and feel of the app, slowing down the arrival of 
new images, and switching the background colour to bright 
orange to make it clear that a different task was being set. 


We refer the reader to Paper II for the details of the particu¬ 
lar set of imaging survey data used in this first Space Warps 
project. Here, we summarize very briefly the choices that 
were made, in order to provide the context for our general 
description and illustrations of the Space Warps system. 


3.1 The CFHT Legacy Survey 

The four CFHT Legacy Survej[^ (CFHTLS, |Gwyn||2012 l 
“Wide” fields cover a total of approximately 160 square de¬ 
grees of sky (after taking into account tile overlaps). With 
high and homogeneous image quality (the mean seeing in the 
g-band is 0.78”), and reaching limiting magnitudes of around 
25 across the ugriz filter set, this survey has yielded several 
dozen new gravitational lenses on both galaxy and group 


scales ( Gavazzi et al.|2014 Sonnenfeld et al.|2013 Cabanac 
|et al.|2007| [More et al.|2012p . The quality of the data, com¬ 

bined with the presence of these comparison “known lens” 
samples, makes this a natural choice against which to de¬ 
velop and test the Space Warps system. The CFHTLS is 
also well-representative of the data quality expected from 
several next-generation sky surveys, such as DES, KiDS, 
HSC and LSST. We use the stacked images from the final 
T0007 release taken from the Terapix websit^for this work. 

In order to investigate the completeness of the previous 
semi-automated lens searches in the CFHTLS area, we de¬ 
signed a “blind search” as follows. We divided the CFHTLS 
pointings into some 430,000 equal size, overlapping tiles, ap¬ 
proximately 82 arcsec on a side. We refer to these images as 
“test images.” The “training images” were derived from a 
small subset of these, as discussed above. In future, larger 
area, projects we expect to implement a somewhat different 
strategy of producing image tiles centred on particular pre¬ 
selected “targets,” which might make for a more efficient (if 
less complete) survey. However, we do not expect the per¬ 
formance of citizen image inspectors to change significantly 
between these strategies: to first order, both strategies re¬ 
quire the inspectors to learn what lenses look like, and then 
search the presented images for similar features. 


3.2 Image Presentation 

The CFHTLS g, r and i-band images have the greatest av¬ 
erage depth and highest average image quality of the sur¬ 
vey, and we chose to focus on this subset. (The u and z- 
band images were also made available for perusal in Talk). 
We made colour composite PNG format images following 


^ http://www.cfht.hawaii.edu/Science/CFHTLS/ 
* ftp://t07.teraplx.fr/pub/T07/ 
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Figure 2. Typical Space Warps “sims” from the Stage 2 training set. The top right-hand corner insets indicate the location of the 
simulated lens in each of these training images. Volunteers needed to click on these specific features in order to make a correct classification. 



Figure 3. Typical Space Warps “duds” from the Stage 2 training set. All of these subjects were correctly classified by the community 
as not likely to contain gravitational lenses. 


the prescription of|Lupton et al.| (20041 (with extensions by 
Wherry et al.|2004[ and some particular choices of our own), 


using the HumVI software]^ Specifically, we first rescaled 
the pixel values of each image, in the notation of [Lupton] 


et al. into flux units (“picomaggies”), via the image AB ze- 


ropoint mo and the pixel value calibration factor / given 
by logj^Q / = 0.4(30.0 — mo). We then multiplied these cali¬ 
brated images by further aesthetic “scales” Si^r,g before com¬ 
puting the total intensity image I and applying an arcsinh 
stretch. (The scales are re-normalized to sum to one on in¬ 
put.) Thus, the red {R), green (G) and blue (B) channel 
images correspond to the CFHTLS i, r and gr-band images 
in the following way: 


I = 

{i-Si+r-Sr + g-Sg), 

R = 

asinh (a ■ Q ■ I) 

?. • .Q4 • ---- 


QI 

G = 

asinh (a • Q • I) 

r • Sr ' - 


QI 

B = 

asinh (a ■ Q • I) 
Q-I 


We chose to allow the composite image formed from these 
channel images to saturate to white: any pixels in any of the 
channel images lying outside the range 0 to 1 was assigned 


® The HumVI colour image composition code used 
in this work is open source and available from 
http://github.com/drphilmarshall/HumVI 


the value 0 or 1 appropriately. This was not the recommen¬ 
dation of Lupton et al. (20041, but we found it to still give 
very informative but also familiar-looking astronomical im¬ 
ages. 


The non-linearity parameters Q and a control the 
brightness and contrast of the images. We first tuned a 
(which acts as an additional scale factor) until the back¬ 
ground noise was just visible. Then we tuned the colour 
scales s to find a balance between exposing the low surface 
brightness blue features (important for lens spotting!), and 
having the noise appear to have equal red, green and blue 
components. Non-linearity sets in at about l/(Qa) in the 
scaled intensity image. Finally, tuning Q at fixed a deter¬ 
mines the appearance of bright galaxies, which we need to be 
suppressed enough to allow the low surface brightness fea¬ 
tures to show through, but not so much that they no longer 
looked like massive galaxies. 


These parameters were then fixed during the produc¬ 
tion of all the tiles, in order to allow straightforward com¬ 
parison between one image and another, and for intuition 
to be built up about the appearance of stars and galaxies 
across the survey. (Alternative algorithms, such as adjusting 
the stretch and scale dynamically according to, for example, 
the root-mean-square pixel value in each image, can lead to 
better presentation of bright objects, but in doing so they 
tend to hide the faint features in those images: we needed to 
optimize the detectability of these faint features.) Examples 
of CFHTLS training set images prepared in this way can be 
seen in Figures and We also defined two alternative sets 
of visualization parameters, to display a “bluer” image and a 
“brighter” image in the classification interface Quick Dash- 
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Table 1. Example HumVI image display parameters, for the 
CFHTLS images. 


Image 

Scales Si, Sr, Sg 

a 

Q 

“Standard” 

0.4, 0.6,1.7 

0.09 

1.0 

“Brighter” 

0.4, 0.6,1.7 

0.17 

1.0 

“Bluer” 

0.4, 0.6, 2.5 

0.11 

2.0 


board (Section |2.1[ |. The QD code performs the same image 
composition as just described, but dynamically on FITS im¬ 
ages in the browser. These FITS images are also available for 
viewing in Talk, via the main Zooniverse Dashboard image 
display tool, which again offers the same stretch settings, 
as a starting point for image exploration. Our parameter 
choices are given in Table 


4 CLASSIFICATION ANALYSIS 

Having described the classihcation interface, the training 
images and the test images, we now outline our method¬ 
ology for interpreting the classifications made by the vol¬ 
unteers, and describe how we applied this methodology in 
the two classification stages of the CFHTLS project in the 
Space Warps Analysis Pipeline (SWAP) code|^ 

Each classification made is logged in a database that 
stores the subject ID, volunteer ID, a timestamp and the 
results of each classification: the image pixel-coordinate po¬ 
sitions of every marker placed. The “category” of subject 
- whether it is a “training” subject (a simulated lens or a 
known non-lens) or a “test” subject (an unseen image drawn 
from the survey) - is also recorded. For training subjects, 
we also store the “kind” of the subject as a lens (“sim”), or 
a non-lens (“dud”), and also the “flavour” of lens object if 
one is present in the image (“lensed galaxy”, “lensed quasar” 
or “cluster lens”). This information is used to provide the 
instant feedback, but is also the basic data used in a prob¬ 
abilistic classification of every subject based on all image 
views to date. Not all volunteers register with the Zooni¬ 
verse (although all are prompted to do so); in these cases 
we record their IP addresses as substitute IDs. 

While the Space Warps web app was live, and classifi¬ 
cations were being made, we performed a daily online anal¬ 
ysis of the classifications, updating a probabilistic model of 
every anonymous volunteer’s data, and also updating the 
posterior probability that each subject (in both the train¬ 
ing and test sets) contains a lens. This gave us a dynamic 
estimate of the posterior probability for any given subject 
being a lens, given all classifications of it to date. Assigning 
thresholds in this lens probability then allowed us to make 
good decisions about whether or not to retire a subject from 
the system, in order to focus attention on new images. 

The details of how the lens probabilities are calculated 
are given below. In summary: 

• Each volunteer is assigned a simple software agent, 

® The open source SWAP code is available from 
https://github.com/drphilmarshall/SpaceWarps 


characterised by a confusion matrix M. The two indepen¬ 
dent elements of this matrix are the probabilities, as es¬ 
timated by the agent, that the volunteer is going to be 1) 
correct when they report that an image contains a lens when 
it really does contain a lens, Mll = Pr(“LENS”|LENS,T), 
and 2) correct when they report that an image does not con¬ 
tain a lens when it really doesn’t contain a lens, Mnn = 
Pr(“NOT”|NOT,T). The confusion matrix contains all the 
information the agent has about how good its volunteer is 
at classifying images. 

• Each agent updates its confusion matrix elements based 
on the number of times its volunteer has been right in each 
way (about both the sims and the duds) while classifying 
subjects from the training set, accounting for noise early on 
due to small number statistics: T is the set of all training 
images seen to date. 

• Each agent uses its confusion matrices to update, via 
Bayes’ theorem, the probability of an image from the test 
set containing a lens, Pr(LENS|C,T), when that image is 
classified by its volunteer. (C is the set of all classifications 
made of this subject.) 

For a detailed derivation of this analysis pipeline, please 
continue reading through Section |4.1| below. Alternatively, 
Section 14.51 contains illustrations of the calculation. 

4.1 SWAP: the Space Warps Analysis Pipeline 

Our aim is to enable the construction of a sample of good 
lens candidates. Since we aspire to making logical decisions, 
we define a “good candidate” as one which has a high poste¬ 
rior probability of being a lens, given the data: Pr(LENS|d). 
Our problem is to approximate this probability. The data d 
in our case are the pixel values of a colour image. However, 
we can greatly compress these complex, noisy sets of data 
by asking each volunteer what they think about them. A 
complete classification in Space Warps consists of a set of 
Marker positions, or none at all. The null set encodes the 
statement from the volunteer that the image in question is 
“NOT” a lens, while the placement of any Markers indi¬ 
cates that the volunteer considers this image to contain a 
“LENS”. We simplify the problem by only using the Marker 
positions to assess whether the volunteer correctly assigned 
the classification “LENS” or “NOT” after viewing (blindly) 
a member of the training set of subjects. 

How should we model these compressed data? The cir¬ 
cumstances of each classification are quite complex, as are 
the human classifiers themselves: the volunteers learn more 
about the problem as they go, but also inevitably make occa¬ 
sional mistakes (especially when classifying at high speed). 
To cope with this uncertainty, we assign a simple software 
agent to partner each volunteer. The agent’s task is to inter¬ 
pret their volunteer’s classification data as best it can, using 
a model that makes a number of necessary approximations. 
These interpretations will then include uncertainty arising 
as a result of the volunteer’s efforts and also the agent’s ap¬ 
proximations. The agent will be able to predict, using its 
model, the probability of a test subject being a LENS or an 
empty field given both its volunteer’s classification and its 
volunteer’s past experience. In this section we describe how 
these agents work, and other aspects of the Space Warps 
analysis pipeline (SWAP). 
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4.2 Agents and their Confusion Matrices 

Each agent assumes that the probability of a volunteer 
recognising any given simulated lens as a lens is some num¬ 
ber, Pr(“LENS”|LENS,d*), that depends only on what the 
volunteer is currently looking at, and all the previous train¬ 
ing subjects they have seen (and not on what type of lens 
it is, how faint it is, what time it is, etc. ). Likewise, it 
also assumes that the probability of a volunteer recognis¬ 
ing any given dud image as a dud is some other number, 
Pr(“NOT” |NOT, d*), that also depends only on what the 
volunteer is currently looking at, and all the previous train¬ 
ing subjects they have seen. These two probabilities define 
a 2 by 2 “confusion matrix,” which the agent updates, ev¬ 
ery time a volunteer classihes a training subject, using the 
following very simple estimate: 

Pr(“A”|X,d‘)« (2) 

Here, X stands for the true classification of the subject, 
i.e. either LENS or NOT, while “X" is the corresponding 
classification made by the volunteer on viewing the subject. 
Nx is the number of training subjects of the relevant type 
the volunteer has been shown, while N“x” is the number 
of times the volunteer got their classifications of this type 
of training subject right, d* stands for all Albns + Anot 
training data that the agent has heard about to date. 

The full confusion matrix of the volunteer’s agent is 
therefore: 

fc_ rPr(“LENS”|NOT,d*fe) Pr(“LENS”|LENS, d‘j,)' 

^ [Pr(“NOT”|NOT,d*fe) Pr(“NOT”|LENS,d*fe)J ’ 

_ Mln Mll ^ /o', 

Mnn M-nl 

Note that these probabilities are normalized, such that 
Pr(“NOT”|NOT) = 1 - Pr(“LENS” |NOT). 

Now, when this volunteer views a test subject, it is this 
confusion matrix that will allow their software agent to up¬ 
date the probability of that test subject being a LENS. Let 
us suppose that this subject has never been seen before: the 
agent assigns a prior probability that it is (or contains) a 
lens is 


Pr(LENS) = po (4) 


where we have to assign a value for po- In the CFHTLS, we 
might expect something like 100 lenses in 430,000 images, 
so po = 2 X 10“^ is a reasonable estimate. The volunteer 
then makes a classification Ck (= “LENS” or “NOT”). We 
can apply Bayes’ Theorem to derive how the agent should 
update this prior probability into a posterior one using this 
new information: 


Pr(LENS|Cfe,dl) = (5) 

Pr(C'fc|LENS,d‘fe) • Pr(LENS) 

[Pr(Cfc|LENS, dl) ■ Pr(LENS) + Pr(C'fc|NOT, d‘J • Pr(NOT)] ’ 


which can be evaluated numerically using the elements of 
the confusion matrix. 

For example, suppose we have a volunteer who is al¬ 
ways right about the true nature of a training subject. Their 
agent’s confusion matrix would be 


^perfect 


0.0 1.0 

1.0 0.0 


( 6 ) 


On being given a fresh subject that actually is a LENS, this 
hypothetical volunteer would submit C = “LENS”. Their 
agent would then calculate the posterior probability for the 
subject being a LENS to be 

Pr(LENS| “LENS”, d‘ ) = ^ = 1.0, 

[1.0 • po -I- 0.0 • (1 -po)] 

(7) 


as we might expect for such a perfect classifier. Meanwhile, a 
hypothetical volunteer who (for some reason) wilfully always 
submits the wrong classification would have an agent with 
the column-swapped confusion matrix 


obtuse 


1.0 0.0 

0.0 1.0 ’ 


( 8 ) 


and would submit C = “NOT” for this subject. However, 
such a volunteer would nevertheless be submitting useful 
information, since given the above confusion matrix, their 
software agent would calculate 

Pr(LENS| “NOT” ,Tk) = j— --^ = LO. 

[1.0 • Po -I- 0.0 • (1 -po)] 

(9) 


Obtuse classifiers turn out to be as helpful as perfect ones, 
because the agents know to trust the opposite of their clas¬ 
sifications. 


4.3 Online SWAP: Updating the Subject 
Probabilities 

Suppose the k + l'’*' volunteer now submits a classification, 
on the same subject just classified by the fe**' volunteer. We 
can generalise Equation]^ by replacing the prior probability 
with the current posterior probability: 

Pr(LENS|C'fc+i,d\+i,d) = (10) 

ipr(Cfc+i|LENS,d\+i) • Pr(LENS|d) (11) 
where Z =Pr(Cfc+i|LENS, d\+i) ■ Pr(LENS|d) 

+ Pr(Cfc+i|NOT, d‘fe+i) • Pr(NOT|d), 

and d = {Ck,d\} is the set of all previous classifications, 
and the set of training subjects seen by each of those volun¬ 
teers. Pr(LENS|d) is the fundamental property of each test 
subject that we are trying to infer. We track Pr(LENS|d) 
as a function of time, as it is updated; by comparing it to 
a lower or upper threshold, we can make decisions about 
whether to retire the subject from the classification inter¬ 
face, or flag it for further study, respectively. 

Finally, the confusion matrix obtained from the applica¬ 
tion of Equation!^ has some inherent noise which reduces as 
the number of training subjects classified by the agent’s vol¬ 
unteer increases. For simplicity, the discussion has thus far 
assumed the case when the confusion matrix is known per¬ 
fectly; in practice, we allow for uncertainty in the agent con¬ 
fusion matrices by averaging over a small number of samples 
drawn from Binomial distributions characterised by the ma¬ 
trix elements Pr(C'A,|LENS, d^) and Pr(C'fe|NOT, d^). The 
associated standard deviation in the estimated subject prob¬ 
ability provides an error bar for this quantity. 

In Section and Appendix below, we define several 
quantities based on the probabilities listed above that serve 
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to quantify the performance of the crowd in terms of the 
information they provide via their classifications, and report 
on the performance of the system in returning a sample of 
lens candidates as a function of Pr(LENS|C', T) threshold. 

4.4 Offline SWAP 

The probabilistic model described above does not need to 
be implemented as an online inference. Indeed, it might be 
more appropriate to perform the inference of all agent con¬ 
fusion matrix elements and subject probabilities simultane¬ 
ously. Such a joint analysis would implement in full the soft¬ 
ware agents’ basic model assumption that their volunteers 
have innate and unchanging talent for lens spotting, that 
is parameterised by two constant confusion matrix elements 
which simply need to be inferred given the data. 

We refer to this alternative calculation as “offline” anal¬ 
ysis, because it does not need to be carried out one classifi¬ 
cation at a time (and hence can be done at any time after 
the data is collected). Note that in this offline inference, the 
effect will be that of applying the time-averaged confusion 
matrices to each classification, rather than a set that evolves 
as the agents (and in the real world, the volunteers) learn. 
This will mean that the early classifications will effectively 
not be downweighted as a result of the agent’s ignorance. On 
the other hand, this ignorance provides some conservatism, 
reducing the noise due to early classifications if they are 
unreliable: in practice we do downweight (via Laplace/add- 
one smoothing) the early classifications made by each vol¬ 
unteer, for this reason. Whether or not an offline analysis 
out-performs an online one is a matter for experiment. 

The mechanics of how we carry out the offline infer¬ 
ence will be presented elsewhere (Davis et al, in prepara¬ 
tion) . Here we briefly note that the procedure is to maximize 
the joint posterior probability distribution for all the model 
parameters (some 74,000 confusion matrix elements and 
430,000 subject probabilities) with a simple expectation- 
maximisation algorithm. This takes approximately the same 
CPU time as the Stage 2 online analysis, because no ma¬ 
trix inversions are required in the algorithm. The algo¬ 
rithm scales well, and is actually faster than the online 
analysis with the larger Stage 1 dataset. The expectation- 
maximisation algorithm is robust to initial starting parame¬ 
ters in, e.g., initial agent confusion matrix elements and Sub¬ 
ject probabilities. The difference in performance between the 
online and offline analyses is presented in Section [5.1| below. 

4.5 Application to the CFHTLS Classifications 

Figure shows the confusion matrix elements of 200 
randomly-selected agents, as they were on a particular day 
during the Stage 1 online analysis. Many volunteers classify 
only a small number of images, and so their software agents’ 
confusion matrix elements have remained close to their ini¬ 
tial values of (0.5,0.5). As more images are classified (shown 
by the yellow point size), the agents’ matrix elements tend to 
move towards higher values, partly as the volunteers attain 
greater skill levels (green point sizes, see Section below) 
but mostly as the agent updates its confusion matrix, fn 
this upper right-hand quadrant, the agents perceive their 
volunteers to be “astute.” 



' 2013 - 10-29 10 : 16 : 35 : 



0.0 0.2 0.4 0.6 0.8 1.0 


Pr("LENS"|LENS) 

Figure 4. Typical Space Warps agent confusion matrix ele¬ 
ments. At a particular snapshot, 200 randomly-selected agents 
are shown distributed over the unit plane, with a tendency to 
move towards the “astute” region in the upper right hand quad¬ 
rant as each agent’s volunteer views more images. Yellow point 
size is proportional to the number of images classified; green point 
size shows agent-perceived “skill” (Appendix]^. 


During the Stage 1 classification of the CFHTLS im¬ 
ages, we assigned a prior probability for each image to con¬ 
tain a lens of 2 x 10“’’, based on a rough estimate of the 
number of expected lenses in the survey, and the fraction of 
the survey area covered by each image. We then assigned two 
values of the images’ posterior probability, Pr(LFNS|C, T), 
to define “detection” and “rejection” thresholds. These were 
set to be 0.95 and (approximately symmetrically in the log¬ 
arithm of probability), 10“^. Subjects that attained proba¬ 
bility of less than the rejection threshold were scheduled for 
retirement and subsequently ignored by the analysis code. 
Subjects crossing the P = 0.95 detection threshold were 
also subsequently ignored, but they were not retired from 
the website, just so that more volunteers could see them. 

The progress of the subjects during the online analysis 
is illustrated in Figure Subjects appear on this plot at 
the tip of the arrow, at zero classifications and prior prob¬ 
ability; they then drift downwards as they are classified by 
the crowd, with each agent applying the appropriate “kick” 
in probability based on what it hears its volunteer say. En¬ 
couragingly, sims (blue) tend to end up with high probabil¬ 
ity, while duds (red) pile up at low probability; test subjects 
(black) mostly drift to low probability, but some go the other 
way. The latter will help make up the candidate sample. As 
this plot shows, around 10 classifications are typically re¬ 
quired for a subject to reach the retirement (or detection) 
threshold. 

The online analysis code was run every night during 
the project, and subjects retired in batches after its comple¬ 
tion. This introduced some inefficiency, because some classi¬ 
fications were accumulated in the time between them cross¬ 
ing the rejection threshold and the subject actually being 
retired from the website. (We quantify this inefficiency in 
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Posterior Probability Pr(LENS|d) 


Figure 5. Typical Space Warps Stage 1 subject trajectories. 
Top: 200 randomly-selected subjects drift downwards as they are 
classified, while being nudged left and right in probability by the 
agents as they interpret the volunteers input. The dotted vertical 
lines show (left to right) the retirement threshold, prior proba¬ 
bility, and the detection threshold. Different colours denote the 
different kinds of subject. Bottom panel: histograms of all the 
subject probabilities computed to date, sub-divided by subject 
kind. The blue bar (of correctly-detected sims) hides a grey bar 
of around 3000 new lens candidates, which are the subject of Pa¬ 
per II. 

Section [6.2| below.) As subjects were retired from the site, 
more subjects were activated. In this way, the volunteers 
who down-voted images for not containing any lensed fea¬ 
tures enabled new images to be shown to other members of 
the community. 

When all the subjects had either been retired, or at least 
classified around 10 times or more, the web app was paused 
and reconfigured for Stage 2. The sample of subjects classi¬ 
fied during Stage 2 was selected to be all those that passed 
the detection threshold (Pr(LENS|C', T) > 0.95) at Stage 1. 
These were classified for one week, with no retirement but a 
maximum classification number of 50 each. The number of 
subjects at Stage 2 was small enough that we did not need 
to retire any: instead, we simply collected classifications for 
a fixed period of time (about 4 weeks). Without the time 
pressure motivating an online-only calculation (as there had 
been during Stage 1), we carried out an offline analysis (Sec- 
tion|4.4|above) of the Stage 2 classifications as well, both for 


comparison and as we will see in the next section, to improve 
the pipeline performance. 


5 RESULTS 

In this section we present our findings about the perfor¬ 
mance of the Space Warps system, in terms of the classifi¬ 
cation of the training set, the information contributed by the 
crowd, and the speed at which the image set was classified. 

5.1 Sample Properties 

We first quantify the performance of the Space Warps sys¬ 
tem in terms of the recovery of the training set images. At 
Stage 1, this set contained around 5712 simulated lenses, 
and 450 duds; at Stage 2, we used 152 images of simulated 
lenses and 201 duds selected as Stage 1 false positives (Sec¬ 
tion [5)3^. Figure]^ shows receiver operating characteristic 
(ROC) curves for CFHTLS Stage 1 and Stage 2. These plots 
show the true positive rate (TPR, the number of sims cor¬ 
rectly detected divided by the total number of sims in the 
training set), and the false positive rate (FPR, the number 
of duds incorrectly detected divided by the total number of 
duds in the training set), both for a given sample of detec¬ 
tions defined by a particular probability threshold, which 
varies along the curves. In both stages, these curves show 
that true positive rates of around 90% were achieved, at 
very low false positive rates. The probability threshold that 
was used for retirement during the Stage 1 online analysis 
was 0.95; this point is marked with a symbol on each curve. 
This turned out to be close to optimal (although a better 
approach would be to keep track of the ROC curve as the 
survey progressed!). 

For comparison we show the results of an analysis where 
the classifications of training images were ignored, and the 
agents’ confusion matrix elements were instead all simply 
assigned initial values of 0.75, which then remained con¬ 
stant. This set-up emulates a very simple unweighted vot¬ 
ing scheme, where all classifications are treated equally. In 
this case, the TPR never reaches 80% in Stage 1 or 60% in 
Stage 2, thus illustrating the benefit of including training 
images and allowing the agents to learn via their Bayesian 
updates. When the software agents are allowed to update 
their confusion matrices, the choice of initial confusion ma¬ 
trix is not very important: the same 0.75 initial values ap¬ 
plied to normal, learning agents resulted in only a slightly 
lower TPR than the default case. 

The dot-dashed curves show the results from the offline 
analysis. At Stage 1 the results are very similar to the on¬ 
line version that was actually run (solid line). However, at 
Stage 2 there is marginal evidence of there being greater 
benefit to doing the analysis offline (Section |4.4| |. Over 85% 
TPR is achieved at zero FPR in the offline analysis , while if 
one is willing to accept a false positive rate of 5%, the true 
positive rate rises to over 95%, showing that some of the 
sims that were missed in the online analysis may be being 
recovered by doing the analysis offline. (The same is true at 
Stage 1, but to a lesser extent.) One interpretation for this 
result (if it holds up) would be that the offline analysis, by 
using each agent’s entire history, is making less noisy proba¬ 
bility estimates. This could be consistent with other citizen 
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Figure 6. Receiver operating characteristic curves for the Space Warps system, using the CFHTLS training set. Left: Stage 1, right: 
Stage 2. Insets show a magnified view of the top lefthand corner of each plot. Linestyles illustrate the software agent properties (initial 
confusion matrix elements, learning or not) discussed in the text, as well as the difference between the online and offline analyses. The 
Stage 2 sample was defined using the online Stage 1 results (solid blue curve), while the offline analysis was chosen for the final Stage 2 
results (dot-dashed orange curve). Lens probability threshold varies along the curves: the symbols indicate the points where the value of 
this quantity is 0.95. In the lefthand panel, the circle point is masked by the square point. 


science/crowdsourcing projects where, in the absence of high 
qnality information about classifiers, sophisticated strategies 
tend to under-perform naive but simpler ones (|Waterhouse| 
|2013[ although we note that this rule of thumb seems not to 
extend to simple voting in this case!). 

Assuming Poisson statistics for the fluctuations in the 
numbers of recovered lenses, the uncertainty in the measured 
Stage 2 TPR values is around 8%, but the online and offline 
samples are highly correlated, such that the uncertainty on 
the difference between the ROC curves is somewhat less than 
this. Still, a larger validation set is needed to test these algo¬ 
rithmic choices more rigorously. At Stage 1, high TPR can 
be measured to better than 1%. 

Adopting the online Stage 1 analysis, and the offline 
Stage 2 analysis, we show in Figure a plot of the more 
familiar (to astronomers) quantities of completeness versus 
purity, again for the two stages. As in Figure]^ the detection 
threshold varies along the curves. Completeness is defined 
as the number of correctly detected sims divided by the to¬ 
tal number of sims in the training set, while purity is the 
number of correct detections divided by the total number 
of detections]^ If the training set were to be sampled fairly 
from test set, the completeness of the training set would be 
equal to the completeness of the test set. (In practice this 
will likely only be approximately true, as simulated lenses 
[and the distributions of their properties] are used instead 
of real ones.) 

The purity depends on the proportion of sims to duds, 
and so the purity of the test set must be approximated by 
rescaling the training set to the expected proportion of lens 
systems to not-lens systems in the survey. We expect there to 
be around 90 lenses in the CFHTLS already (a rate of 1 lens 

^ The completeness is equivalent to the TPR and is also known 
elsewhere as the “recall.” The purity is also known as the “preci¬ 
sion.” 



Figure 7. Completeness-estimated purity curves for the 
Space Warps system, using the CFHTLS training set. The curves 
are truncated at the high purity end by the “detection” limit 
subject probability (subjects in the online Stage 1 analysis were 
deemed to be detected at p = 0.95, while at Stage 2 rounding er¬ 
ror led to an upper limit of p = 1.0). The inset shows a magnified 
view of the bottom righthand corner of the plot. 


in every 5000 images or so); to a very good approximation 
the number of non-lens images in the survey is just the num¬ 
ber of images in the survey (some 430,000). First we com¬ 
pute the expected number of false positives by multiplying 
the FPR by the expected number of non-lenses in the sur¬ 
vey (430,000). Then we multiply the TPR by the expected 
number of lenses in the survey (90), to get the expected 
number of true positives. The sum of the true positives and 
the false positives gives the expected sample size; dividing 
the expected number of true positives by this sample size 
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gives the purity. Note that the completeness is invariant to 
this transformation. The Stage 1 curves are truncated by 
the retirement of subjects in this phase, which sets the min¬ 
imum size of this sample. We see from the solid blue curve 
that over 90% completeness was able to be reached, albeit 
in samples with not more than 30% purity. We set the de¬ 
tection threshold for Stage 1 to be 0.95 (shown by the blue 
star in Figure]^, leading to a sample with 94% completeness 
and 15% purity. 

Does this performance level vary between the different 
types of gravitational lens? To investigate the completeness 
to the three different types of lens in the training set, we 
repeated the same procedure but now considering only the 
detections of a certain kind of lens and of the non-lenses in 
the training set. We estimate the expected number of lenses 
and non-lens false positives by dividing the lens and dud sets 
into equal fractions. The lensed quasar part of the training 
set yielded the highest completeness, suggesting that these 
were the easiest sims to spot. The lensed galaxies were recov¬ 
ered at the lowest completeness, likely due to the difficulty 
of separating the lens and source galaxy light. 

At Stage 2, where no retirement was carried out, it 
was possible to reach 100% purity: the knee of the curve 
is at just under 90% completeness. However, the purity de¬ 
creases rapidly if higher completeness than this is sought. 
The optimal sample in this simulated lens search experi¬ 
ment would have been constructed with a threshold value of 
Pr(LENS|C,T) > 0.47. At 100% purity and 90% complete¬ 
ness, it would have contained around 89 lens candidates. 
However, we remind the reader that all these values are de¬ 
pendent on the properties of the training subjects, which 
were chosen to be fairly visible: we expect the complete¬ 
ness to real lenses to be somewhat lower, as a result (see 
Paper H). 

Finally, it is worth noting the implications of the re¬ 
sults in this section for future studies of samples of lenses 
(and lens candidates) discovered through visual inspection 
at Space Warps. The ROC curve analysis we have carried 
out should be very familiar to those working in machine 
classihcation, and in fact is identical to that which would 
be performed on the classifications made by a new auto¬ 
mated method. Supervised machine learning methods and 
the Space Warps system as described here both require a 
training set, and both return quantitative, probabilistic clas- 
sihcations (consisting of a “label” and some measure of con¬ 
fidence in that label) that can be used in further analysis. A 
good example of where such quantitification is important is 
in the derivation of the selection function for a given sample. 
While such a derivation is beyond the scope of this work, it 
is sufficient to note that from the point of view of an as¬ 
tronomer seeking to derive a selection function, the labels 
produced by Space Warps and those produced by a machine 
learning method can be treated equivalently, that is, we have 
succeeded in elevating visual inspection to a quantitative 
science. Indeed, the discussion of completeness given above 
is all in terms of recovery of a large set of simulated lenses, 
just as it would be in the case of an automated method— 
and in both cases, the limiting factor is the realism of the 
training set. In Paper H we investigate this limit further 
by assessing the performance of the Space Warps system 
against the (small) set of real lens candidates known to lie 
in the field; for now, we note that the training set used in 


this work constitutes a valid sample of objects for any alter¬ 
native machine learning lens detection method to be tested 
against. In future, we could employ a larger training set, to 
enable the selection efficiency to be characterised as a func¬ 
tion of multiple observables. This will be achievable, since 
the small training set we used in the current study was clas¬ 
sified around 20 times more than was the test set: we could 
therefore collect many fewer classifications per training im¬ 
age in a much larger training set. 


5.2 Crowd Properties 

To investigate the properties of the Space Warps crowd, as 
characterised by their agents, we define the following quan¬ 
tities and plot their one and 2-D marginal distributions in 
Figure This figure only shows the Stage 1 agents for clar¬ 
ity, but the trends we found the same trends in Stage 2. 

“Effort:” The number of test images, Ac, classihed by a 
volunteer. In Stage 1, the mean effort per agent was 263; in 
the shorter Stage 2 it was 81. 

“Experience:” The number of training images. At, clas¬ 
sified by a volunteer. In Stage 1, the mean experience per 
agent was 29; in Stage 2 (where the training image frequency 
was set higher) it was 34. 

“Skill:” The expectation value of the information gained 
per classihcation (in bits) by that volunteer, (A7)o,5, for 
subjects which have lens probability 0.5 (Appendix]^. Ran¬ 
dom classihers have (A/)o.5 = 0.0 bits, while perfect classi- 
hers have (A/)o .5 = 1.0 bit. All software agents start with 
(A7)o.5 = 0.0 bits. The skill of an agent increases as training 
subjects are classihed, and the agent’s estimates of its con¬ 
fusion matrix elements improve. In Stage 1, the mean skill 
per agent was 0.04 bits; in Stage 2 it was 0.05 bits. 
“Contribution:” This is the integrated skill over a vol¬ 
unteer’s test subject classihcation history, a quantity repre¬ 
sentative of the total contribution to the project made by 
that volunteer (see Appendix for more discussion of this 
quantity). The classihcations of training images allow us to 
estimate the skill, while the classihcation of test images de¬ 
termines contribution. In Stage 1, the mean contribution per 
agent was 34.9 bits; in Stage 2 it was 33.5. 

“Information:” The total information A7 generated by 
the software agent during the volunteer’s classihcation ac¬ 
tivity. This quantity depends on the value of each subject’s 
lens probability when that subject was presented to the vol¬ 
unteer (simply because information gain is dehned in terms 
of posterior relative to prior probabilities, Appendix]^. As 
a result, we expect this raw quantity to be noisier than the 
“contribution” dehned above. The lower righthand panel of 
Figure conhrms this: there is a strong, linear correlation 
between contribution and information gain, but at a given 
contribution, the distribution of information gain has a tail 
at high values, corresponding to volunteers that were lucky 
in the number of high probability subjects that they hap¬ 
pened to be shown (the information gain per classihcation is 
maximized when the subject probability is 0.5, while most 
subjects have probabilities closer to or less than the prior). 
In most of the following discussion we use the contribution 
when characterizing crowd participation, but we include the 
information gain for completeness. 
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Figure 8. Key properties and contributions of the Space Warps Stage 1 crowd. Plotted are the 1-D and 2-D smoothed marginalized 
distributions for the logarithms of the properties of the agents described in the text. The contours contain 68%, 95% and 99% of the 
distributions. 


The leftmost column of Figurej^shows how the last four 
of these properties depends on the effort expended by the 
volunteers. As expected, we see that experience is tightly 
correlated with effort, reflecting the design of training im¬ 
ages being presented throughout each stage (albeit at de¬ 
creasing frequency). We also see a strong correlation between 
effort and skill, which was hoped for but not guaranteed: the 
more images the volunteers see, the better able to contribute 
information they are. 

The effort distribution shows two peaks, suggestive of 
two types of participation. The sharp spike at just a few 
images classified presumably corresponds to visitors who 
only classify a few subjects before leaving the site again. 
The broader hump contains people doing tens to hundreds 
of classifications: the skill vs. experience panel of Figure 


shows this group to achieve a broad distribution of skill, 
peaking at around 0.05 but with a long tail to higher values. 

At high values of experience and effort, the skill is al¬ 
ways high. There seem to be very few agents logging large 
numbers of classifications at low skill: almost all high ef¬ 
fort “super-users” have high skill. These two properties are 
reflected in both the contributions these volunteers make 
(third row) and the information they generate (fourth row), 
and we suggest that this would be a useful metric for deter¬ 
mining “well-designed” implementations of citizen science 
projects. 

We found that the distributions for the Stage 2 agents 
to be qualitatively very similar to those for the Stage 1 
agents. The differences are: 1) the maximum effort possi¬ 
ble at Stage 2 is smaller, simply because fewer subjects were 
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Figure 9. Cumulative distributions of the contributions made 
by the agents: 90% of the contribution was made by the highest 
contributing 1% of the crowd at Stage 1 (blue), and the highest 
contributing 7% of the crowd at Stage 2 (orange). The Stage 1 
agents are shown in blue, the Stage 2 agents in orange. “Experi¬ 
enced volunteers” classified 10 or more training subjects. 



Figure 10. Broad cumulative distributions of agent skill: the 
most skilled 20% of the crowd only possess around 80% of the 
skill, at Stage 1. The Stage 1 agents are shown in blue, the Stage 2 
agents in orange. “Experienced volunteers” classified 10 or more 
training subjects. 


available to be classified, and 2) the information generated 
per agent was slightly higher at Stage 2, just becanse the 
subjects had (on average) higher probability. 

Figure shows the Space Warps crowd to have quite 
broad distributions of logarithmic effort, skill, and contribu¬ 
tion. To better quantify the contributions made by the vol¬ 
unteers, we show their cumulative distribution on a linear 
scale in Figure This plot shows clearly the importance of 
the most active volunteers: at Stage 1, 1.0% of the volunteers 
- 375 people - made 90% of the contribution. At Stage 2, 
where it was not possible to make as many classifications 
before running out of subjects, 7.2% of the volunteers - 141 
people - made 90% of the contribution. 

However, it is not the case that only these small groups 
were capable of making large contributions. The cumulative 
distribution of agent skill is shown in Figure |10| these dis¬ 
tributions are significantly broader than the corresponding 
distributions of agent contribution, in Figurej^ For example, 
80% of the skill is distributed among 20% of the agents. The 
inexperienced volunteers also possess a significant fraction of 
the skill. We find that dividing the crowd into “experienced 
volunteers” (who have seen 10 or more training images), and 
“inexperienced volunteers” (the rest), results in two groups 
containing approximately equal total skill (see the dashed 
curve in Figure 101: the most skillful 20% of experienced 
volunteers (1824 people) only possess 43% of the total skill. 
The breadth of the skill distributions suggests that the high 
level of contribution made at Space Warps by experienced 
volunteers is largely a matter of choice (or perhaps, availabil¬ 
ity of time!). There were many other volunteers that were 
skillful enough to make large contributions - they just didn’t 
classify as many images. 


5.3 Classification Speed 

How fast does the Space Warps crowd classify subjects? 
Each software agent records the timestamp of each classifi¬ 


cation its volunteer makes; by measuring the time lags be¬ 
tween successive classifications, we can make estimates of 
the crowd’s classification and contribution speed. We plot 
the former quantity for both Stage 1 and Stage 2 in Fig¬ 
ure [m normalizing the speed and time axes to their respec¬ 
tive totals. The fractional classification rates in each stage 
of the survey fall off in approximately the same way, despite 
the factor of nearly 20 difference in crowd size. Stage 1 (con¬ 
sisting of ~ 430, 000 subjects) was completed in around 5000 
hours (by some 37,000 participants), with classification rates 
in the first few days averaging at ~ lO’^ per hour. This was 
stimulated by various forms of advertising (press releases, 
and emails to registered Zooniverse users). The asymptotic 
classification rate was around ten times lower, at ~ 10® per 
hour. Interestingly, the average skill was found to be ap¬ 
proximately constant over the lifetime of Stage 1, leading 
to a contribution rate that tracks closely the classification 
rate. (A 10% increase in average skill was seen over the first 
40 days, and a decrease of the same amount in the last 40 
days - not enough to cause a significant difference between 
classification and contribution rate behaviour, but perhaps 
reflecting a learning period at the start and then a decrease 
in participation later on.) Stage 2 (nearly 3400 subjects) was 
completed in around 600 hours by around 2000 participants, 
with classification rates starting at ~ 2000 per hour and 
decaying to ~ 100 per hour. 

The similarities between Stage 1 and 2 regarding the 
results above, and despite the difference in the task set, sug¬ 
gest that these numbers can be scaled to estimate cautiously 
the speed of future Space Warps projects. For example, the 
completion time for a Space Warps project may be approx¬ 
imated as 



where Ns is the number of subjects to be classified, and Np 
is the number of volunteers in the crowd. Just using the 
numbers above, we find a characteristic timescale of tq = 
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Figure 11. Fractional classification rate in the Space Wahps 
— CFHTLS survey. Fractional classification rate is the number 
of classifications per unit time, divided by the final total number 
of classifications. Fractional survey time is the time elapsed since 
the beginning of the survey, divided by the total length of the 
survey. The unit of time in the fractional classification rate is one 
survey, so that the curves in integrals under the curves are both 
one. 

18 days, but as we will see in the next section, we expect 
this to be significantly shorter in future projects. 


6 DISCUSSION 

What can we learn from the results of the previous sec¬ 
tion, for future projects? Potential improvements to the 
Space Warps system can be divided into three categories: 
performance, efficiency and capacity. 

6.1 Improving performance: reducing 
incompleteness and impurity 

We investigated the source of the incompleteness and im¬ 
purity visible in Figure by examining the Stage 2 “false 
negatives” (simulated lenses that incorrectly acquired P < 
10“^) and false positives (duds or impostors that incor¬ 
rectly acquired P > 0.95), and their behaviour as they 
are classified using the online analysis trajectory plots in¬ 
troduced in Section 1^ (Figure [^. Figure [l^ shows 2 exam¬ 
ple simulated lenses that were missed (Pr(LENS|C', T) < 
10“^) by the Space Warps system (top row), and 2 exam¬ 
ple non-lenses that were incorrectly flagged as candidates 
(Pr(LENS|C, T) > 0.95) by the Space Warps system (bot¬ 
tom row). 

In some cases the rejection of the false negatives is un¬ 
derstandable: the lensed features are faint, or in some cases, 
appear somewhat unrealistic compared to real lenses. How¬ 
ever, in other cases a reasonably obvious lens was passed 
over. This mainly seems to be due to noise in the system: 
when only low-skill classifiers view a subject, all the updates 
to its posterior probability are small, and if none are very 
confident about the presence of lensing, the subject follows 
a random walk down its trajectory plot. This can be seen 


in the top left panel of Figure |12| The false positives show 
similar behaviour, for example in the bottom lefthand panel 
of the figure. 

As well as subjects being “unlucky” in this way, there 
are two less common failure modes associated with mistakes 
made by higher skill volunteers, illustrated in the righthand 
column of the figure. In the false negative subject shown 
in the top right-hand panel, several high skill classifiers up¬ 
date the subject upwards in log probability by some way 
each time, but other, comparable skill classifiers mis-classify 
the system to lower probability. The trajectory looks like 
a random walk, but with bigger step sizes; this particular 
subject came very close to crossing the detection threshold 
three times, but didn’t quite make it. The bottom right-hand 
panel shows an example of a final, apparently rare, failure 
mode: we see some short-step random walk behaviour, fol¬ 
lowed by a mis-classihcation by a very high skill “expert” 
classifier after 20 classifications that “kicks” the subject to 
high probability. 

There are a number of places where we can address 
these problems and improve system performance: adding 
flexibility to the classification interface, educating the volun¬ 
teers, assigning subjects for classification, and interpreting 
the classification data. 

Some of the mistakes made by reasonably high skill clas¬ 
sifiers working at high speed could have perhaps been cor¬ 
rected by those classifiers themselves, had they had access to 
a “go back” option. While clearly enabling error reduction, 
we might worry about such a mechanism having a negative 
effect on volunteer confidence: it may be that encouraging 
this sort of checking would result in increased and not neces¬ 
sarily productive caution. This could be tested by presenting 
a fraction of the volunteers with a version of the site that 
actively suggested that they take this approach, and then 
tracking the relative performance of “collaborative” clas¬ 
sifications compared with independent ones. We leave the 
exploration of this to further work. 

Mistakes by both high and low skill classifiers could 
be reduced by improving the training in the system, which 
could be done in several ways. One is to make more training 
images available to those who want or need it. A basic level 
of training images are needed for SWAP to build up an accu¬ 
rate picture of each classifier’s skill - but one could imagine 
volunteers choosing to see more training images (still at ran¬ 
dom) in their stream. We could also experiment with pro¬ 
viding greater training rates early on for all volunteers, al¬ 
though this carries significant risk: retention rates may drop 
if too few “fresh” test images are shown early on. Another 
way of improving the training could be to provide more in¬ 
formation about what gravitational lens systems look like. 
In this project, the Spotter’s Guide, and the Science, FAQ 
and About pages were always available on the site, but as a 
passive background resource. We might consider providing 
more links to this guide in the feedback messages shown to 
the classifiers as they go - or perhaps extending these mes¬ 
sages to themselves include more explanations and example 
images. We might also investigate a more dynamic Spot¬ 
ter’s Guide: a set of manipulable model lenses illustrating all 
the possible image configurations that those deflectors can 
make could help volunteers very quickly gain understanding 
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ASWOOOSwyp: Stage 2 False Negative 




ASW0006rkx: Stage 1 False Negative 




Figure 12. Illustrative examples of false negatives (blue) and false positives (red) from the classification of the Space Warps —CFHTLS 
training set. The trajectory plot for each of the 4 subjects is shown to the right of its image. Lefthand panels show the most common 
types of trajectories: random walks with short steps resulting from no high skill classifiers viewing the subject. The righthand panels 
show examples where higher skill classifiers have been involved, causing larger, more efficient “kicks” (top) but also catastrophic mis- 
classifications (bottom right). Note that both these subjects were very nearly detected but didn’t quite reach the threshold set (marked 
by the blue dashed line). Insets for the false negatives show the lens feature. The inset for the lower righthand panel shows where most 
volunteers believed a lens was located. There is no corresponding inset for the lower left panel because no particular feature stood out 
to the citizens. Instead, the volunteers clicked many different features. 


of what lensed features can look like. Such a tool is under 
development 

However, many false negatives (around 60%, from in¬ 
spection of a random sample) seem to be due to the sta¬ 
tistical noise associated with the many short, semi-random 
direction kicks arising from classifications made by low skill, 
inexperienced classifiers (Section |5.2[ ). While improved train¬ 
ing will help reduce this effect, it is here that targeted task 
assignment could also help improve system performance, by 
bringing higher skill volunteers in when needed. We adver¬ 
tised Stage 2 of the current project to all registered users; 
it was taken up by Stage 1 veterans with a broad range of 
skill, and also picked up a significant number of new users 
who did not have enough time to acquire high skill (since 
Stage 2 was quite short): of the 1964 volunteers who took 
part in the Stage 2 classification round, only 774 were vet¬ 
erans from Stage 1. This issue is discussed further in Pa¬ 
per II, where we investigate the known lenses missed by the 
Space Warps system. 

Figure[^shows how the Stage 1 skill maps on to Stage 2 
skill and contribution. This figure suggests that, while the 
gains are likely small (since most of the contribution is still 
made by high skill volunteers), there could have been some 
benefit to opening Stage 2 on the basis of Stage 1 agent 
skill, in order to reduce the noise in the system generated 
by new and low skill volunteers in this more difficult clas¬ 
sification stage. In future, dynamically allocating subjects 


® http://slowe.github.io/LensToy 
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Figure 13. Stage 2 classifier skill, as a function of their Stage 1 
skill. Veterans from Stage 1 are shown in blue, while new volun¬ 
teers are shown in orange. Point size is proportional to the total 
contribution made, while dashed lines are drawn at the mean 
values for each sample. 


to volunteers according to their agents’ skill could allevi¬ 
ate this issue. In particular one can imagine doing this for 
subjects that have classification histories that indicate that 
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they are worthy of further study. These are things we plan 
to experiment with in future. 

Finally, there could be some performance gains to be 
made by improving the SWAP agent model, or its imple¬ 
mentation. Low skill is typically a result of inexperience 
(Section 5.21 - but it could be the agent that is inexperi¬ 
enced, as much the classifier. In a future paper we plan to 
investigate the use of the test images as well as the training 
images in accelerating the agent’s learning (Davis et al, in 
preparation). We also plan to investigate the use of offline 
analysis at Stage 1, partly for the same reason (see also Sec¬ 
tion 6.2 below.) Given the above findings about the effects 
of noise in the SWAP system, we are also motivated to ex¬ 
plore further the possible noise-reducing effects of the doing 
the analysis offline (Section |5.1[ ). 

One might also consider looking at introducing more 
conditional dependences in the confusion matrix elements, 
to allow for some classihers having greater skill in spotting 
one type of lens than another, or, more generally, as a func¬ 
tion of lens property (such as colour, brightness, and image 
separation). In the current model, all agents are considered 
to be completely independent, whereas in fact we might ex¬ 
pect there to be significant clustering of the confusion matrix 
elements in the Mmn — Mll plane. A hierarchical model 
for the crowd, with hyper-parameters describing the distri¬ 
bution of confusion matrix elements across the population, 
may well accelerate the agent learning process by includ¬ 
ing the notion of one agent being likely to be similar to its 
neighbours in the parameter space. Finally, it is worth not¬ 
ing that the model of Simpson et al. ( 2013[ ) explicitly avoids 
the assumption of the agent confusion matrix elements be¬ 
ing constant in time (as was assumed in Section|^, allowing 
the development of volunteer skill to be more accurately 
tracked - and that they did see some time-evolution in the 
supernova zoo classifiers’ skill. Finding a way to incorporate 
such a learning model into SWAP while retaining its online 
character is an interesting challenge for future work. 


6.2 Improving efficiency 

Table shows the total effort, contribution, skill and in¬ 
formation generated in both Stage 1 and Stage 2 of the 
CFHTLS project, with the total numbers of agents and sub¬ 
jects for comparison. These numbers allow us to quantify the 
efflciency of the system. 

The contribution per classification is defined in terms 
of a hypothetical subject with lens probability of 0.5; one 
bit of information is needed to update such a subject’s lens 
probability to either zero or one. This means that a maxi¬ 
mally complete classification stage would yield a total con¬ 
tribution (summed over all agents) equal to the number of 
subjects. The ratio of this hypothetical optimum to the ac¬ 
tual total contribution is therefore a measure of the stage’s 
inefflciency. We find our inefficiency (by dividing column 2 
by column 3) to be 33% and 17% in Stage 1 and Stage 2 
respectively. In Stage 1, this inefflciency is due to the daily 
processing: we were not able to retire subjects fast enough, 
and so they remained in the system, being over-classihed. 
Indeed, only 3705745 classifications were needed to retire all 
the subjects: the ratio of this to the total number made is 
34%. (The remaining 1% is due to not all subjects being 
classihed to 1 or 0 probability.) At Stage 2, we did not re¬ 


tire any subjects at all; the inefflciency in this case was by 
design, to give everyone a chance to appreciate what they 
had found together. (An unwanted side effect of this policy 
was noted in Figure 13 in the previous sub-section.) 

It is clear that to increase the efflciency of the system 
we need to reduce the time lag between the classification 
being made and its outcome being analyzed. The optimal 
way to do this would be to have the web app itself analyze 
the classifications in real time. This is under investigation 
for future projects. There may still be a place for a daily or 
weekly offline analysis: this could potentially reduce the false 
negative and false positive rates by “resurrecting” subjects 
that had been retired by the online system before the soft¬ 
ware agents had time to learn enough about their classiher’s 
high skill. 

If, as expected, the system efflciency could indeed be im¬ 
proved by a factor of three via real-time classification anal¬ 
ysis, we would expect the characteristic completion time tq 
in Equation!^ to decrease by roughly the same factor, sug¬ 
gesting that a dataset of 10® images could be searched for 
lenses by a crowd of 10® volunteers in about 6 days. 

The futuristic lens-finding problem sketched in Sec¬ 
tion where 10® lenses are to be found in a 10"* square 
degree wide field imaging survey, is approximately repre¬ 
sentative of the challenge facing both the LSST and Euclid 


strong lensing science teams (LSST Science Collaboration 


et al. 2009 Refregier et al. 20101. What role could citizen 


scientists at Space Warps play in lens searching in the next 
decade? From Equation 12 and assuming ro = 6 days (as 
above), we see that for the Space Warps crowd to be able 
to classify 10® images of photometrically-selected massive 
galaxies and groups in approximately 60 days, it would need 
to contain 10^ volunteers. This is only a factor of ten larger 
than the current size of the Zooniverse userbase. Alterna¬ 
tively, suppose that automated lens finding algorithms were 
able to select as few as 10® targets for visual inspection, 
corresponding to a sample with 10% purity and a surface 
density of ~ 10 per square degree (a rate within reach of 
RingFinder, for example: Gavazzi et al. 20141. Now in¬ 
creasing To by a factor of three to allow for more inspection 
time per subject, we find that in this case Space Warps 
could enable a crowd of just 10® volunteers to assess them 
in about three weeks. 


6.3 Increasing crowd capacity 

Finally, we comment on the size of the first Space Warps 
crowd, and how the system might be scaled up for future 
surveys. In Section [5.2| and Table|^ the following rough pic¬ 
ture emerged for the Space Warps crowd in this project: it 
consisted of a few 10^ volunteers, with a few 10® achieving 
considerable skill, and a few 10® having the time to make 
a significant contribution. Slightly more quantitatively, we 
might note that the total skill of the crowd, computed by 
summing the skill of all the agents, is a measure of the effec¬ 
tive crowd size, in the sense that a crowd of perfect classifiers 
would be of this size. By this measure, the Stage 1 crowd 
was equivalent to a team of 1470 perfect classifiers, while 
the Stage 2 crowd was equivalent to a team of 102 perfect 
classifiers. With this same crowd, we saw in Section [5.3| that 
surveys providing a few 10^ subjects would be completed 
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Table 2. Total crowd and subject sample properties from the CFHTLS project. 


Stage 

Subjects 

Contribution 

Agents 

Skill 

Classifications 

Candidates 

Information 


j 

Ef (A/>o.5‘r"‘ (bits) 

K 

Tlk (A/>o.5fc (bits) 

Ef ^c,k 

iVdet 

E/ Ef ^Ij,k (bits) 

1 

427064 

1292016.3 

36982 

1471.9 

10802125 

3381 

91122.6 

2 

3679 

21895.8 

1964 

102.4 

224745 

89 

1640.4 


quickly, if the high contribution rate of the current crowd 
were to be repeated. 

There are (at least) two ways in which we might in¬ 
crease the numbers of high-contribution volunteers for larger 
projects in future. The first is simply to increase the total 
crowd size, and hope that a similar fraction of volunteers 
make large contributions. Greater exposure of the website 
to the public through mass media would help. Another op¬ 
tion is to advertise the project to new groups of volunteers 
by translating the website into other languages (something 
which is now supported by the Zooniverse). A multi-lingual 
userbase would come with its own set of challenges, espe¬ 
cially in terms of volunteers’ continuing training and inter¬ 
actions on Talk. 

The second way to scale up the number of high contri¬ 
bution classifiers is to increase the rate at which new volun¬ 
teers become dedicated volunteers. Based on feedback from 
the wider Space Warps community, this could potentially 
be achieved through closer collaboration with the science 
team. It is also possible that dynamically assigning subjects 
on the basis of the volunteer’s skill could act as an incen¬ 
tive to some volunteers to increase their contribution, al¬ 
though any such approach would need to take into account 
the need to optimise not only for efficiency but also for the 
most interesting or pleasurable volunteer experience; a solu¬ 
tion which gave every volunteer a very uniform experience, 
for example, is unlikely to succeed even if it appears opti¬ 
mally efficient. Reducing the rate at which new volunteers 
lose interest could also play a role. Anecdotally, it seems 
fairly common for new volunteers to be wary of classifying 
at all, for fear of introducing errors. Better explanation of 
how their early classifications are analyzed could help as¬ 
suage these fears: Section |5.2| shows that effectively down¬ 
weighting new volunteers’ classifications (by setting their 
agents’ initial confusion matrix elements to those of a ran¬ 
dom classifier) leads to best performance, a result which 
should be of some comfort to the nervous volunteer. 


7 CONCLUSIONS 

We have designed, implemented and tested a system for de¬ 
tecting new strong gravitational lens candidates in wide field 
imaging surveys by crowd-sourced visual inspection. The 
Space Warps web-based classification interface presents 
carefully prepared colour composite sky images to volun¬ 
teers, who mark features they propose to have been lensed. 
The participants receive ongoing training with a mixture 
of simulated lens and known-to-be-empty images, and we 
use this information to automate the interpretation of their 
markers. In our first lens search we simply divided the 


CFHTLS imaging into some 430,000 tiles, and collected over 
11 million volunteer image classifications. 

By analyzing the classifications made of the training 
set, we conclude that gravitational lens detection by crowd- 
sourced visual inspection works, and in the following specific 
ways: 

• Participation levels were high (about 37,000 volunteers, 
contributing classifications at rates between 10^ and 10"* im¬ 
ages per hour), suggesting that if this can be maintained, 
visual inspection of tens of thousands of images could be 
performed in just a few weeks. An expanded crowd of 10® 
volunteers (which has already been reached in Zooniverse) 
would be able to inspect a plausible sample of 10® LSST or 
Euclid targets on similar time scales. 

• Over the course of the two stages of the CFHTLS 
project, the set of images was reduced by a factor of 10® 
(from a few hundred thousand to a few hundred), enough 
for an expert team to take over. 

• The “skill” of the volunteers correlates strongly with 
the number of images they have classified. Quantifying the 
“contribution” as the integrated skill over the test subjects 
classified, we find that about 90% of the contribution was 
made by a few hundred volunteers (1% of the crowd) - but 
that the broad width of the skill distribution means that a 
few thousand volunteers could have made comparable con¬ 
tributions had they classified more images. 

• The optimal true positive rate (completeness) and false 
positive rate in the training set were estimated to be around 
92 — 94% and < 1% (in both classification stages). This FPR 
translates to a purity of approximately 15 — 30%. We find 
that even higher purities were achievable at Stage 2 (per¬ 
haps as a result of the least visible candidates already being 
discarded): here, 100% purity was reached at just under 90% 
completeness. Because Space Warps is a supervised learn¬ 
ing system, these numbers should be taken to be upper lim¬ 
its to what we should expect for real lenses (which have not 
been selected for a high visibility training set). 

• The simulated gravitational lenses that were missed 
were predominantly galaxy-scale lenses with faint blue 
galaxy sources, whose lensed features are difficult to distin¬ 
guish from the light from the lens galaxy (consistent with 
what we find also for real lenses, see Paper H). We observed 
some additional scatter, with some simulated lenses accu¬ 
mulating low probability simply due to classification noise. 
Future searches will need to address of these issues. We ex¬ 
pect it to be especially interesting to try crowd-sourcing the 
visual inspection of target systems which have had their can¬ 
didate lens galaxy light automatically subtracted off (as, for 
example, RingFinder does), in a collaboration between hu¬ 
mans and machines. 

In Paper H we present the results of this CFHTLS 
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lens search in more detail, in particular focusing on the 
detection of real lenses, and the comparison between 
Space Warps and various automated lens finding methods. 
In this paper we have shown how Space Warps provides a 
lens candidate detection service, crowd-sourcing the time- 
consuming work of visually inspecting astronomical images 
for gravitationally-lensed features. We invite survey teams 
searching for lenses in their wide-held imaging data to con¬ 
tact us if they would like our help. 
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APPENDIX A: INFORMATION GAIN PER 
CLASSIFICATION, AGENT “SKILL” AND 
“CONTRIBUTION” 

With an agent’s confusion matrix in hand we can compute 
the information generated in any given classification. This 
will depend on the confusion matrix elements (Equation]^ 
but also on the probability of the subject being classified 
containing a lens. The quantity of interest is the relative 
entropy, or Kullback-Leiber divergence, between the prior 
and posterior probabilities for the possible truths T given 


the submitted classification C: 

A7 = ^Pr(T|C)log,lg)^ 
=PKLENS|C)log,h!gHp 

-fPr(NOTlC') loga , (Al) 


where, as above, C can take the values “LENS” or “NOT”. 
Substituting for the posterior probabilities using Equationj^ 
we get an expression that just depends on the elements of 
the confusion matrix M and the pre-classification subject 
lens probability Pr(LENS) = p: 


. r Mcl , Mcl 

Al -log2- 

Pc Pc 

, sMcn , 

+ (1 -p)-log2 

Pc 


■Mcn 

Pc 


(A2) 


where the common denominator pc = pMcL + {l—p)A4cN- 
This expression has many interesting features. If p is either 
zero or one, AI{C) — 0 regardless of the value of C or the 
values of the confusion matrix elements: if we know the sub¬ 
ject’s status with certainty, additional classifications supply 
no new information. If we set p to be the prior probability. 
Equation |A2| tells us how much information is generated by 
classifying it all the way to p = 1 (which a perfect classifier, 
with Mll ~ Mnn = 1, can do in a single classihcation). 
For a prior probability of 2 x 10“’^, 12.3 bits are generated 
in such a “detection.” Conversely, only 0.0003 bits are gen¬ 
erated during the rejection of a subject with the same prior: 
we are already fairly sure that each subject does not contain 
a lens! Imperfect classifiers (with Mll and Mnn both less 
than 1) generate less than these maximum amounts of infor¬ 
mation each classification; the only classifiers that generate 
zero information are those that have Mll = 1 — AIjvjv (or 
equivalently, Mcl = Mcn for all values of C). We might 
label such classifiers as “random”, since they are as likely to 
classify a subject as a “LENS” no matter the true content 
of that subject. 

Equation |A2| suggests a useful information theoretical 
definition of the classifier skill perceived by the agent. At a 
fixed value of p, we can take the expectation value of the 
information gain A7 over the possible classifications that 
could be made: 

= E E P^iT\C)Pr{C) log2 

= -^Pr(r)log 2 Pr(T) 

T 

+ E E P'^(2”|C) log 2 Pr(r|C) 

c T 

=p[S{Mll)+S{1-Mll)] 

+ (1 “ P) [>5(AlAriv) -I- 5(1 — Aliviv)] 

— S [pMll + (1 — p)(l — Aliviv)] 

— S[p{\ — Mll) + — p)Mnn\ (A3) 


where 5(2;) = a;log 2 2 :. If we choose to evaluate {Al) at 
p — 0.5, the result has some pleasing properties. While 
random classifiers presented with p = 0.5 subjects have 
{A7)o.5 = 0.0 as expected, perfect classifiers appear to the 
agents to have (A7)o.5 = 1.0. This suggests that {A7)o.5, the 
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amount of information we expect to gain when a classifier 
is presented with a 50-50 subject, is a reasonable quantifi¬ 
cation of normalised skill. A consequence of this choice is 
that the integrated skill (over all agents’ histories) should 
come out to be approximately equal to the number of sub¬ 
jects in the survey, when the search is “complete” (and all 
subjects are fully classified). Therefore, a particular agent’s 
integrated skill is a reasonable measure of that classifier’s 
contribution to the lens search. 

We conservatively initialize both elements of each 
agent’s confusion matrix to be M'll = = 0.5, that of 

a maximally ambivalent random classifier, so that all agents 
start with zero skill. While this makes no allowance for vol¬ 
unteers that actually do have previous experience of what 
gravitational lenses look like, we might expect it to help 
mitigate against false positives. Anyone who classifies more 
than one image (by progressing beyond the tutorial) makes 
a non-zero information contribution to the project. 

The total information generated during the CFHTLS 
project is shown in Table Interpreting these numbers is 
not easy, but we might do the following. Dividing this by the 
amount of information it takes to classify a Space Warps 
subject all the way to the detection threshold (lens proba¬ 
bility 0.95), and then multiplying by the survey inefficiency 
gives us a very rough estimate for the effective number of 
detections corresponding to the crowd’s contribution: these 
are 2830 and 25 bits for Stage 1 and Stage 2 respectively. 
These figures are close to the numbers of detections given in 
column 7 of the table. 

This paper has been typeset from a Tf^X/ file prepared 

by the author. 
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